home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Floppyshop 2
/
Floppyshop - 2.zip
/
Floppyshop - 2.iso
/
diskmags
/
5791-.end
/
dmg-6143
/
lza_rept
/
compres2.txt
< prev
next >
Wrap
Text File
|
1995-05-29
|
17KB
|
332 lines
Experiments in Data Compression II
(or "Why couldn't I have thought of this sooner?")
by John Kormylo
E.L.F. Software
State of the Art
In the following table are shown the results for compressing
three different files using three different methods. 'ZIP'
refers to STZIP 2.5 (deflating). 'OLD' refers to the improved
LZA method documented in the previous report using a single 16K
search tree. 'BEST' refers to the best results obtained
previously, using three search trees and testing all possible
ways to compress the next 65 characters.
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
ZIP | 25207 | 32444 | 496880 |
OLD | 24942 | 34003 | 670774 |
BEST | 22196 | 32035 | 455017 |
-------+--------------+-------------+--------------+
Redundancy Weighting
The main inefficiency of the old LZA method is that while
there can be many matches of a given length, only one can be used
at a time. No matter how heavily the entry used is weighted,
there is still lost information in the other possible matches.
Using a hash table instead of a search tree could solve this
problem, but hash tables are inefficient for storage (a 16K
binary search tree with a maximum match length of 65 characters
would require up to 1,054,960 entries in a hash table). Also,
once something is put into a hash table it can never be taken out
again (at least, using the implementation documented in "The Data
Compression Book" by Mark Nelson).
However, since the binary search tree has already
alphabetized the "dictionary" of possible matches, it is
relatively easy to use an index corresponding to its position in
alphabetical order. This can be done easily by including in the
node structure a count of how many nodes are in this branch
(actually, the self-balancing functions already require this).
More important, it is also easy to locate the largest and
smallest indexes corresponding to a match of a given length (and
certainly faster than searching for the most heavily weighted
match). The 'weight' for this 'match' is therefore the number of
matches for this length in the search tree (as it turns out, the
arithmetic compression algorithm normally implements weights as a
range of index values). The length (and match indicator) codes
are implemented using character codes 256 through 320, as before.
When expanding the data, one will get some value in the
range of possible matches. One would then search through the
binary tree to locate the specific dictionary entry. Given the
text pointer from this node, one can then go through the search
tree again and locate the largest and smallest matches for this
length.
Obviously, this approach is more efficient than any possible
weighting scheme involving single dictionary entries. Also, this
approach is relatively insensitive to dictionary size, since the
number of matches should increase proportionately as the number
of entries increase (assuming a uniformly random distribution of
matches). Other advantages are that the node structure does not
need to store the weight, last match length or sort index, and
that one need not sort the indexes into weight order.
The main disadvantage is that the expansion algorithm must
also sort the entries into a binary search tree. Not only does
this make the expansion slower, it requires that the compression
algorithm not use information not yet coded when constructing the
search tree. Normally, the compression algorithm uses all 65 (or
whatever the maximum match length is) future characters to place
a text pointer into the tree. Needless to say, the expansion
algorithm does not have access to these future characters.
Another disadvantage is that one cannot use this approach to
match future characters. When the old LZA method ran into a
sequence of repeated characters (not yet in the dictionary), it
would code the first normally and code the rest by matching the
text pointer just entered into the tree. The expansion algorithm
would recursively generate the entire sequence:
for(i=0; i<len; i++) text[i+1] = text[i];
Backwards Matching
Instead of sorting strings in alphabetical order from left
to right, one could sort strings from right to left. The search
tree would then consist of pointers to the ENDS of matches,
rather than the STARTS of matches. Obviously, this approach does
not use future characters, so the compression and expansion
algorithms will automatically keep in step.
The disadvantage is that instead of searching through the
tree for the longest match at a given start location, one must
search through the tree for a match of a known length for 65
possible ending locations. (Actually, one can save this
information from one step to the next, but it still requires a
search for every byte in the message instead of only those bytes
not skipped.) This approach is slower than 'OLD' but much faster
than 'BEST'.
The following table shows the results using the longest
match (as 'LONG'), using the match with the fewest bits per
character (as 'AVG') and searching all possible ways to compress
the next 65 bytes (as 'BEST' again).
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
LONG | 23322 | 34362 | 649317 |
AVG | 23043 | 34693 | 660107 |
BEST | 22007 | 33276 | 580261 |
-------+--------------+-------------+--------------+
These results were somewhat disappointing. While new records
were set for DEMONSTR.DOC, this method seems fundamentally
incapable of performing well on the other two files. (All of the
above results were for a 16K search tree.)
The next table shows the results from using the above 'AVG'
method for various tree sizes. The results (as compared to those
in the previous report) for DEMONSTR.DOC are what should be
expected when the statistical assumptions (uniformly distributed
matches) are satisfied: decreasing effectiveness as the tree
size decreases and slightly better compression for all tree sizes
(above 256). The results for WORDLIST.TXT are what should be
expected when matches tend to be locally clustered: increasing
effectiveness as the tree size decreases (up to a point) and
slightly better results (for tree sizes above 512). The results
for ELFBACK.PRG possibly indicate a lot of repeated sequences:
less effectiveness for all tree sizes.
tree | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
32 | 45008 | 45812 | 544218 |
64 | 42188 | 43799 | > 536383 |
128 | 38167 | 41574 | 538023 |
256 | 34187 | 39331 | 548685 |
512 | 31076 | 37480 | 564429 |
1024 | 27844 | 36567 | 581900 |
2048 | 25690 | 35234 | 601221 |
4096 | 24088 | 34728 | 624977 |
8192 | 23235 | > 34571 | 644742 |
16384 | > 23043 | 34693 | 660107 |
FIFO Buffer
An alternative solution is to not add match locations to the
search tree until all 65 characters have been coded. This can be
implemented using a 64 entry FIFO (First In/ First Out) buffer to
hold these new locations. In fact, one can use this buffer as an
alternative search tree, similar to the multiple search tree
methods explored in the previous report. This should also
improve performance on files with locally clustered matches.
The following table shows the results for this approach
using a 64 entry tree (unweighted) and a 16K entry tree (weighted
by the number of matches). Again, using the longest match is
labeled as 'LONG', using the fewest bits/character is labeled as
'AVG', and using the (nearly) optimal selections as 'BEST'.
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
LONG | 24618 | 33418 | 597383 |
AVG | 23991 | 33269 | 555619 |
BEST | 21893 | 32147 | 496517 |
-------+--------------+-------------+--------------+
Both the LONG and AVG algorithms out-performed the OLD algorithm
on all three files. However, the new BEST algorithm performed
better on only one of the files.
It should be noted that entries in the FIFO buffer can be
flushed to the main dictionary as soon as all 65 bytes are
transmitted, effectively reducing the size of the FIFO buffer.
However, as shown in the next table, this doesn't buy you much.
In fact, it is worse on two of the three files:
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
LONG | 24630 | 33544 | 599469 |
AVG | 24082 | 33466 | 540713 |
BEST | 22034 | 32345 | 473610 |
-------+--------------+-------------+--------------+
On the other hand, increasing the FIFO buffer to 128 entries
doesn't help either.
FIFO Weighting
The reason the FIFO buffer approach works better than
backwards sorting is that the statistical assumption of a uniform
random distribution of matches is not valid for some files. In
this case the probabilty of finding a match in the 64 entry FIFO
buffer is much greater than the number of equilvalent matches in
the main dictionary divided by the total number of entries.
That being the case, it would also seem reasonable that the
probability of finding a match in the most recent half of the
FIFO buffer is greater than in the older half, etc. One could
assign a weight associated with each entry in the FIFO buffer
based on how long it has been there, and use those weights when
coding the index. These weights would, of course, consist of
(scaled) counts of how many times this 'lag' has been used
(initially set to 1). Since there are only 64 entries, this
should not impose much additional computation cost.
The next table shows the results for using these weights for
the FIFO buffer:
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
LONG | 24620 | 33375 | 583218 |
AVG | 23991 | 33207 | 530798 |
BEST | 22022 | 32088 | 461457 |
-------+--------------+-------------+--------------+
As expected, these results were not quite as good for
DEMONSTR.DOC, but better for the other two files. A list of the
weights generated using the 'AVG' approach is shown in the
Appendix. (It should be pointed out that this is not strictly
speaking a 'lag' in the time series sense, since each increment
can mean anywhere from 1 to 65 bytes.)
Break Even
Another advantage of redundancy weighting is that it does a
much better job for smaller matches. (One could theoretically
use it for single characters, except that you would still need
some way to get the dictionary started.) However, early testing
indicated that the best results were obtained by never using
matches of length 2 and always using matches of length 3 (as
opposed to testing for break even as was done using the 'OLD'
LZA algorithm.)
The next table show the results of using length 2 matches
with FIFO weighting:
| DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
LONG | 24860 | 34719 | 585185 |
AVG | 24054 | 34394 | 520875 |
BEST | 22029 | 31997 | 453143 |
-------+--------------+-------------+--------------+
Comparing this to the previous table, we see that including
length 2 matches only works well with the 'BEST' algorithm (which
implicitly contains a break even test). More importantly, this
allows the new 'BEST' to beat the old 'BEST' (in the first table)
on all three files.
Appendix - Normalized Weights (count / total * 1000)
lag | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
1 | 3 | 21 | 12 |
2 | 5 | 63 | 251 |
3 | 14 | 23 | 109 |
4 | 8 | 66 | 88 |
5 | 18 | 33 | 59 |
6 | 24 | 37 | 48 |
7 | 13 | 24 | 35 |
8 | 19 | 42 | 31 |
9 | 21 | 28 | 26 |
10 | 8 | 35 | 23 |
11 | 21 | 22 | 19 |
12 | 20 | 26 | 16 |
13 | 33 | 20 | 15 |
14 | 24 | 24 | 13 |
15 | 21 | 18 | 13 |
16 | 19 | 17 | 12 |
17 | 26 | 18 | 10 |
18 | 25 | 16 | 10 |
19 | 9 | 18 | 10 |
20 | 25 | 15 | 9 |
21 | 15 | 18 | 8 |
22 | 17 | 17 | 8 |
23 | 14 | 18 | 7 |
24 | 28 | 15 | 7 |
25 | 22 | 13 | 7 |
26 | 17 | 13 | 6 |
27 | 23 | 20 | 6 |
28 | 16 | 15 | 6 |
29 | 17 | 14 | 6 |
30 | 17 | 9 | 5 |
31 | 18 | 14 | 6 |
32 | 14 | 11 | 5 |
33 | 34 | 11 | 5 |
34 | 12 | 8 | 5 |
35 | 26 | 12 | 5 |
36 | 13 | 10 | 4 |
37 | 17 | 9 | 5 |
38 | 17 | 13 | 5 |
39 | 20 | 7 | 4 |
40 | 13 | 10 | 4 |
41 | 16 | 10 | 4 |
42 | 11 | 9 | 4 |
43 | 9 | 7 | 4 |
44 | 9 | 8 | 4 |
45 | 15 | 3 | 4 |
46 | 13 | 4 | 3 |
47 | 18 | 8 | 3 |
48 | 7 | 9 | 3 |
49 | 14 | 5 | 3 |
50 | 9 | 9 | 3 |
-------+--------------+-------------+--------------+
lag | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
51 | 15 | 9 | 3 |
52 | 14 | 5 | 3 |
53 | 20 | 8 | 3 |
54 | 9 | 6 | 3 |
55 | 8 | 6 | 3 |
56 | 8 | 9 | 3 |
57 | 13 | 5 | 3 |
58 | 10 | 11 | 3 |
59 | 4 | 13 | 3 |
60 | 15 | 9 | 2 |
61 | 12 | 4 | 2 |
62 | 7 | 9 | 3 |
63 | 8 | 8 | 3 |
64 | 9 | 7 | 2 |
-------+--------------+-------------+--------------+